Multimodal Vision Transformers With Forced Attention For Behavior Analysis